2024-12-24 de novo prelim

Introduction

3 days ago i was looking through the gtdbtk manual and saw that de_novo_wf was an option for analysis to create the trees, from the description given:

knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")

i beleived this would be something i should do as it might produce more accurate trees. sample 1Dt2d Enterobacter cancerogenus had been placed by the classify_wf in the previous gtdbtk analysis in the genus Pantoea, which lead me to this search. after a bit of trial and error, i produced this script

Methods

This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were

- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10

I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.

Results

This analysis produced these files:

/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log

./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv

./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv

./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results

./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree

I then moved this gtdbtk.bac120.decorated.tree file into Dendroscope for review, all 10 are on one tree, but 1Dt2d is still being placed in the “wrong” genus. on review of its sister accession on the ncbi database.

Conclusion

On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with

completeness: 90%
contamination: 3.6%
Taxonomy check status: failed

Upon viewing the tree in Dendroscope, the joining node has the label 0.968. This I believe to be the probability the relationship is correct. this implies they are the same species, and the online sample is also identified as Enterobacter cancerogenus. However, due to the checkm analysis i find it plausible that they both have been misidentified and are in reality Pantoea species, i find this the most parsimonious explanation. I will follow this up with a CheckM analysis of my own on 1Dt2d

Figure 1 - dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis
Figure 1 - dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis

This was a “technical spike” or proof of concept for de_novo_wf

📌 TODO: do another Checkm analysis on 1Dt2d to see if the values are similar to the online sample

2024-12-25 🎄 beginning of table, checkm

Introduction

i wanted to see if the outputs of checkm differed from checkm2 so ran that on hawk. I also began recreating the innital table for metadata about the bangor samples, in the spirit of automation, a less manual approach was chosen this time around.

Methods

using this script i ran a slurm job on hawk under the lineage_wf of CheckM for all 10 Bangor-made samples, this took just 4 minutes. I also worked on exporting the data i want to tabulate off of hawk. The past way i did this was by manually entering each file and noting down the important characteristics. However, because there are going to be more samples(and i wanted to be clever) i decided to use a more automated process. This was done by identifying different documents in the flye directories on hawk, specifically files called “assembly_info.txt” which contain the same information, but are vastly more exportable. These are stored here: cd /scratch/scw2160/02_outputs/flye_asm/flye_asm_[accession]/ using this script i exported them off of hawk.

Results

the CheckM analysis produced this output directory. My export script exported all 10 “assembly_info.txt” files to a directory in my home directory, as well as adding their accession ID to the name, this is important as otherwise i wouldnt know which belonged to what accession. I then brought them down and stored them here.

Conclusions

with it being christmas i did not take a serious look at the significance of the outputs of either, so that is what i plan to do next so that i can have some conclusions, maybe by the end of tomorrow. In conclusion, this process is only half done and will continue into the following entry(s).

📌 TODO: Complete analysis / processing of checkm and table stuff

2024-12-27 finishing the table and checkm stuff, maybe extra if there is time

Introduction

Basic stuff, finding out what i can do to the checkm analysis to turn it into publishable data, making the table using R scripts and the output files. exams in January are rapidly approaching, so i really do need to pick up the pace if i want to get this done before i begin revising. there is not a real goal for this exact piece, i just want to know how congruent the outputs from CheckM are to CheckM2.

Methods

Using the CheckM documentation. I found the output that i needed was bin_stats_ext.tsv. this is where the contamination(and a lot of other fun looking stats) are held. I can thusly import this file to R to try and extract usefull stuff / tabulate. Over the course of the day i developed and improved CheckMtablegenerator.R. which creates the table layed out in results below. This uses a few packages, one of which was new to me, gt. This is a very useful package for table generation, here is some info.

Results

Table 1: CheckM analysis

Table 1 - subset of data obtained from CheckM analysis
of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality.
Accession Marker lineage Completeness Contamination GC GC std Genome size contigs Longest contig Coding density Predicted genes
1Dt1h o__Actinomycetales 100.00 0.00 65.519% 4.284% 4,336,578 4 2,051,556 90.334% 3,818
1Dt2d f__Enterobacteriaceae 100.00 0.90 53.889% 1.270% 5,345,294 6 2,288,975 87.931% 4,937
1Dt100h o__Actinomycetales 98.84 0.19 52.446% 0.564% 2,379,473 2 2,349,045 89.586% 2,071
2Dt1l o__Actinomycetales 98.82 2.32 67.742% 0.844% 3,996,506 16 1,089,026 93.707% 3,915
3Dt1c o__Sphingomonadales 97.23 1.19 66.487% 1.385% 3,636,266 86 275,317 90.660% 3,433
3Dt2j o__Sphingomonadales 96.19 1.21 65.644% 1.752% 2,813,977 75 195,248 92.779% 2,786
3Dt2c o__Sphingomonadales 95.75 0.85 66.093% 1.604% 3,819,610 84 247,364 89.769% 3,516
3Dt2h o__Actinomycetales 87.98 0.51 70.327% 1.317% 3,006,227 166 125,244 90.324% 2,978
1Dt100g o__Sphingomonadales 76.77 0.00 68.105% 1.332% 2,104,071 153 74,994 92.645% 2,179
2Dt1e o__Actinomycetales 44.74 0.00 71.162% 1.565% 2,332,403 149 70,689 89.402% 2,215
Source: CheckM lineage_wf performed on 2024-12-25

As mentioned, this table is a subset of the data, there was more outputted by CheckM, some of which overlaps with other analyses, but i did not know if it was relevent to include, so left it out, however, creating a new one with the omitted fields in would not be hard. This table will pair with one produced from CheckM2, because from what i’ve seen one is kind of an improvement on the other, but they kind of do different things, its confusing and thus why i want to do the comparison for future reference. I did not know what counted as “significantly complete”, so arbitrarily placed the value at 90%. This analysis shows that 7 of the 10 samples are complete or near enough, with only 3 falling below my line. It should be noted that 2Dt1e, the lowest sample is only found to be 45% complete, this is a large outlier from the rest of the range. Contamination is fairly low in all, none above 3% and 3 samples on 0.00%.

Conclusion

The main purpose of doing this is to compare with Checkm2, so im not sure what conclusions i can get from just this, other than many of the samples look sufficiently complete and i learned a fair amount from this experience. I think this will act as a good first step to compare future analysis against, for confirmation, as many upcoming analyses will give repeat data in some areas. Some of the field, for example “GC” are odd and i dont quite understand their significance, perhaps Aaron can give more info when he gets to this part of the notebook.

📌 TODO: finish the straight flye analysis / start on CheckM2

2024-12-28 flye output tabulation and CheckM2 complete pipeline

Introduction

Today i wanted to finish off the work from yesterday and start on CheckM2. this should produce two tables that i can use to compare with the first one from Checkm, maybe find out which method is superior?

Methods

The order i did the two main focuses in shifted greatly during the day, with me bouncing between flye and CheckM2 analyses, so this section will have subheaders for better readability

Flye analysis 1: assembly_info.txt

I created an R script, “rawflyeanalysis.R”. Basic stuff, imports the text files i chose to download(i wonder if i could download the flye.log files and use a sub() to cut out the non-useful stuff). This then sorts and processes the data into the table seen below using a few packages, Table 2: Assembly_info.txt.

CheckM2: slurm job

Just before lunch(11:45-ish) i set off the slurm job to run CheckM2 on hawk. Let’s see how that goes after lunch, one thing of note though, i did make a minor modification, in the original file, i was running it on the fastas one at a time(?) weird, anyway, i changed it to work on the directory containing all 10, so it should do them all at once. run_checkm2TN.sh.

Flye analysis 2: flye.log

After lunch, firstly i went back to the flye.log analysis i wanted to do largely for posterity. This was horrifyingly complex due to the “complex” nature of the flye.log files. I did this by creating a modified copy of my assembly_info.txt file grabber shell script, called flye.loggetter.sh, using ctrl + h to find and replace instead of manually changing things. I added this to an existing output directory in my home section of hawk, which i exported using MobaXterm(do i need to include a section on how i use MobaXterm?) to my computer. This is the folder called flyestuff, which contains both the flye.log and the assembly_info.txt files for each accession, i grouped them as it seemed relevent for organisation and didnt hinder the coding process at all. The R file i used to tabulate this is flyeloganalysis.R. This produced Table 3: Flye.log

CheckM2: R table generation

This started with doing cd /scratch/scw2160/02_outputs/flye_asm/CheckM2_20241228 to see the outputs, the job took roughly half an hour, ending at 12:15 pm. Firstly, i copied the output directory to my home area, then once again used MobaXterm to bring it down to the git repo, here. Then, i began work on the R file to turn the quality_report.tsv file. This file was formatted very well, so turning it into a table was very simple, done in CheckM2tablegenerator.R. This produced Table 4: CheckM2

Note: there were other outputs to CheckM2, they were very interesting, im going to make the next section about exploring that possiblity, but they dont relate to this analysis.

Results

Table 2: Assembly_info.txt

Table 2 - Data obtained from flye outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Number of contigs Total length Largest fragment Mean fragment length
3Dt2h 166 3,006,227 125,244 18,109.80
1Dt100g 153 2,104,071 74,994 13,752.10
2Dt1e 149 2,332,403 70,689 15,653.71
3Dt1c 86 3,636,266 275,317 42,282.16
3Dt2c 84 3,819,610 247,364 45,471.55
3Dt2j 75 2,813,977 195,248 37,519.69
2Dt1l 16 3,996,506 1,089,026 249,781.62
1Dt2d 6 5,345,294 2,288,975 890,882.33
1Dt1h 4 4,336,578 2,051,556 1,084,144.50
1Dt100h 2 2,379,473 2,349,045 1,189,736.50
Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024
coverage could not be calculated from this method as no result matched what was found in the flye.log files

The basic flye output analysis produced this table, as mentioned, i couldn’t get coverage to match what was on the flye.log file that i got the original of this table from, it is possible that file is the one that is wrong, but genome coverage is kind of all over the place, so i just omitted it as i don’t think its one of the mandatory fields. I chose to sort the table by “Number of contigs” rather arbitrarily, but it does show something interesting in that why is there such a large range in fragment number?

Table 3: Flye.log

Table 3 - Data obtained from flye.log outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Fragments Total length Fragments N50 Largest fragment Scaffolds Mean coverage
3Dt2h 166 3,006,227 26,385 125,244 0 390
1Dt100g 153 2,104,071 18,022 74,994 0 318
2Dt1e 149 2,332,403 22,011 70,689 0 344
3Dt1c 86 3,636,266 63,365 275,317 0 320
3Dt2c 84 3,819,610 69,733 247,364 0 190
3Dt2j 75 2,813,977 71,970 195,248 0 350
2Dt1l 16 3,996,506 345,773 1,089,026 0 289
1Dt2d 6 5,345,294 1,019,507 2,288,975 0 225
1Dt1h 4 4,336,578 1,857,409 2,051,556 0 404
1Dt100h 2 2,379,473 2,349,045 2,349,045 0 224
Source: flye.log files produced in flye_asm analysis, August 27th 2024

This is a table of the type of stuff i first collated in the Summer, it shares many similarities and cross overs with Table 2, neither have quality statistics, but both are metadata largely concerning the length and fragment, a marked difference is this one comes with a concrete value of coverage. Comparative analysis of them feels like a thing that would happen in the conclusion to this document.

Table 4: CheckM2

Table 4 - quality analysis from CheckM2
based on samples taken from the skin microbiome of Dendrobates tinctorius
Accession Completeness Contamination Completeness model used Translation table used Additional notes
1Dt1h 100.00 0.31 Neural Network (Specific Model) 11 None
1Dt2d 100.00 0.87 Neural Network (Specific Model) 11 None
2Dt1l 100.00 0.27 Neural Network (Specific Model) 11 None
3Dt2c 99.94 0.39 Neural Network (Specific Model) 11 None
3Dt2j 99.80 0.00 Neural Network (Specific Model) 11 None
1Dt100h 99.12 0.00 Gradient Boost (General Model) 11 None
3Dt1c 98.34 0.86 Neural Network (Specific Model) 11 None
3Dt2h 86.04 0.10 Neural Network (Specific Model) 11 None
1Dt100g 78.42 0.33 Neural Network (Specific Model) 11 None
2Dt1e 48.29 0.08 Neural Network (Specific Model) 11 None
Source: CheckM2 predict performed on 2024-12-28

This is the table of CheckM2 analysis, there are many differences in which fields are produced, also, there appears to be differences in the values themselves for Completeness and Contamination, perhaps part of the comparitive analysis, done in the conclusion should be the date at which both were at, as older versions will probably have less accurate values, and this may cause the difference as genomics is a fast evolving field. A Noteable difference is that CheckM used an internal diamond database, whereas CheckM2 required me to download my own. Exact comparison will be conducted in the conclusion.

Conclusion

Thus far, i think that i got all of the fields in Table 2 in a more reliable format in Table 1, but with the quality checks as well, both tables suffer weirdness with “cov.” and “GC” are they the same? are they both coverage? it is as mysterious as it is annoying. Let’s see how they match up to CheckM2. It is getting late, i think that tomorrow is going to be the comparitive analysis and then moving onto the fun-looking checkm2 stuff.

📌 ?: CONCLUSION: how do comparison of all 4 tables next to each other without creating a monstrocity of indented code / something about pills/ find date difference in version history between CheckM and CheckM2 comparitive analysis/

2024-12-29 conclusion to tables + prelim investigations into funky checkm stuff

Introduction

Due to this section being largely about the conclusion from the last week, the other sections may be light, as this is largely just a wrap up i was too tired to do yesterday.

Methods

One thing that might be relevent to note is that i changed the Markdown file so that the tables generated above now come from .csv output files. This helps on cutting down the time taken to knit the document and means that there are not multiple instances of code across the document, as the results section down here will include all 4 tables in one form or another. (basically, this increases flexibility of how the data can be used). In order to find the version history for CheckM, i went here ______. For CheckM2, i went here ______ and looked at the comit history as the version on Hawk is a developer version, not one of the official release, beginning at ____.

Results

Comparison of CheckM and CheckM2

Table 5 - Comparison of quality metrics
for the same fasta files between CheckM/1.1.3 and CheckM2/0.1.3
Accession
CheckM/1.1.3
CheckM2/0.1.3
Completeness Contamination Completeness Contamination
1Dt100g 76.77 0.00 78.42 0.33
1Dt100h 98.84 0.19 99.12 0.00
1Dt1h 100.00 0.00 100.00 0.31
1Dt2d 100.00 0.90 100.00 0.87
2Dt1e 44.74 0.00 48.29 0.08
2Dt1l 98.82 2.32 100.00 0.27
3Dt1c 97.23 1.19 98.34 0.86
3Dt2c 95.75 0.85 99.94 0.39
3Dt2h 87.98 0.51 86.04 0.10
3Dt2j 96.19 1.21 99.80 0.00
Source:
CheckM lineage_wf performed on 2024-12-25
CheckM2 predict performed on 2024-12-28

📌 ?: AFTER LUNCH: date the programs, do the comparison for the “raw”/ no package analyses. / look in log files for diamond version for M and M2

All tables

Flye assembly

Table 2 - Data obtained from flye outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Number of contigs Total length Largest fragment Mean fragment length
3Dt2h 166 3,006,227 125,244 18,109.80
1Dt100g 153 2,104,071 74,994 13,752.10
2Dt1e 149 2,332,403 70,689 15,653.71
3Dt1c 86 3,636,266 275,317 42,282.16
3Dt2c 84 3,819,610 247,364 45,471.55
3Dt2j 75 2,813,977 195,248 37,519.69
2Dt1l 16 3,996,506 1,089,026 249,781.62
1Dt2d 6 5,345,294 2,288,975 890,882.33
1Dt1h 4 4,336,578 2,051,556 1,084,144.50
1Dt100h 2 2,379,473 2,349,045 1,189,736.50
Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024
coverage could not be calculated from this method as no result matched what was found in the flye.log files

Flye log

Table 3 - Data obtained from flye.log outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Fragments Total length Fragments N50 Largest fragment Scaffolds Mean coverage
3Dt2h 166 3,006,227 26,385 125,244 0 390
1Dt100g 153 2,104,071 18,022 74,994 0 318
2Dt1e 149 2,332,403 22,011 70,689 0 344
3Dt1c 86 3,636,266 63,365 275,317 0 320
3Dt2c 84 3,819,610 69,733 247,364 0 190
3Dt2j 75 2,813,977 71,970 195,248 0 350
2Dt1l 16 3,996,506 345,773 1,089,026 0 289
1Dt2d 6 5,345,294 1,019,507 2,288,975 0 225
1Dt1h 4 4,336,578 1,857,409 2,051,556 0 404
1Dt100h 2 2,379,473 2,349,045 2,349,045 0 224
Source: flye.log files produced in flye_asm analysis, August 27th 2024

CheckM

Table 1 - subset of data obtained from CheckM analysis
of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality.
Accession Marker lineage Completeness Contamination GC GC std Genome size contigs Longest contig Coding density Predicted genes
1Dt1h o__Actinomycetales 100.00 0.00 65.519% 4.284% 4,336,578 4 2,051,556 90.334% 3,818
1Dt2d f__Enterobacteriaceae 100.00 0.90 53.889% 1.270% 5,345,294 6 2,288,975 87.931% 4,937
1Dt100h o__Actinomycetales 98.84 0.19 52.446% 0.564% 2,379,473 2 2,349,045 89.586% 2,071
2Dt1l o__Actinomycetales 98.82 2.32 67.742% 0.844% 3,996,506 16 1,089,026 93.707% 3,915
3Dt1c o__Sphingomonadales 97.23 1.19 66.487% 1.385% 3,636,266 86 275,317 90.660% 3,433
3Dt2j o__Sphingomonadales 96.19 1.21 65.644% 1.752% 2,813,977 75 195,248 92.779% 2,786
3Dt2c o__Sphingomonadales 95.75 0.85 66.093% 1.604% 3,819,610 84 247,364 89.769% 3,516
3Dt2h o__Actinomycetales 87.98 0.51 70.327% 1.317% 3,006,227 166 125,244 90.324% 2,978
1Dt100g o__Sphingomonadales 76.77 0.00 68.105% 1.332% 2,104,071 153 74,994 92.645% 2,179
2Dt1e o__Actinomycetales 44.74 0.00 71.162% 1.565% 2,332,403 149 70,689 89.402% 2,215
Source: CheckM lineage_wf performed on 2024-12-25

CheckM2

Table 4 - quality analysis from CheckM2
based on samples taken from the skin microbiome of Dendrobates tinctorius
Accession Completeness Contamination Completeness model used Translation table used Additional notes
1Dt1h 100.00 0.31 Neural Network (Specific Model) 11 None
1Dt2d 100.00 0.87 Neural Network (Specific Model) 11 None
2Dt1l 100.00 0.27 Neural Network (Specific Model) 11 None
3Dt2c 99.94 0.39 Neural Network (Specific Model) 11 None
3Dt2j 99.80 0.00 Neural Network (Specific Model) 11 None
1Dt100h 99.12 0.00 Gradient Boost (General Model) 11 None
3Dt1c 98.34 0.86 Neural Network (Specific Model) 11 None
3Dt2h 86.04 0.10 Neural Network (Specific Model) 11 None
1Dt100g 78.42 0.33 Neural Network (Specific Model) 11 None
2Dt1e 48.29 0.08 Neural Network (Specific Model) 11 None
Source: CheckM2 predict performed on 2024-12-28

Table 5 is a direct comparison of each module’s ability to calculate Completeness and Contamination for the same samples. Each column has a data colourisation, the scale for Completeness was 90-100%, so anything appearing in grey i consider “low quality”, the scale for Contamination is 0-3%. Both analyses agree that 3 of the ten accessions (1Dt100g, 2Dt1e and 3Dt2h) do not meet the 90% threshold, and only slightly differ on exact value. 2Dt1e is of special note because it is the only sample below 50% completeness. All values of Completeness are roughly the same, with CheckM2 having slightly higher results across the board. This contrasts with the values of Contamination, in which values differ quite dramatically, with small results getting larger and large results getting smaller. This is seen most in the results for 2Dt1l, in which the CheckM contamination states 2.32%, but this falls sharply in CheckM2 to 0.27%, coincidentally this sample also sees a rise in Completeness from 98.82% to 100%, so perhaps different genes or sequences are differently classified between the two systems. Overall, Contamination is lower in CheckM2 than CheckM.

However, Completeness and Contamination are not the only fields that these modules produce. CheckM returns with a wide array of extra data about the samples (as well as the extra stuff i will discuss in the next section, if that path leads anywhere) including; taxonomy to the order level, accession 1Dt2d even got identified to the family level, as well as many metrics for fragment count and size. It is important to remember that as an undergraduate student i don’t have a full understanding of all of these fields, and this is a subset of the data based upon what i thought was important. This contrasts with CheckM2 which outputs no additional data, however, it does include information about the “Completeness model” and “Translation table” used, which could be useful in fact-checking the results of the analysis.

A secondary analysis i wanted to perform was between the results of the CheckM analysis against that of the manual “assembly_info.txt” and “flye.log” analyses, as they all share overlaps and this could inform future reasearch as to which method is best so all do not need to be performed. All three agree on the number of contigs or fragments (which are either pseudonyms or used interchangeably), as well as the length of the genome and “largest fragment”/ “longest contig”. This is where differences arise in what addition fields each method reveals. The assembly_info.txt files give the mean fragment length, though i am not sure how relevant this is. The flye.log files contain information on “N50”, scaffolds and Mean coverage, i am not sure of the purpose of N50, scaffolds is only ever 0 for all, however mean coverage is perhaps a useful statistic only found in this analysis. Finally, CheckM contains information on “GC”, which i first took to mean Genome Coverage, but upon viewing this page, i discovered meant the proportion of Guanine and Cytosine to Adenine and Thymine, thus GC. This page also contains useful definitions for other fields i did not include, and even some for CheckM2. Other extra fields include coding density and predicted genes, potentially useful.

Conclusion

For the CheckM vs CheckM2 analysis, the version used for CheckM is older, version 1.1.3 was released on the 9th of July 2020, whereas version 0.1.3 of CheckM2 was created on the 18th of July 2022. However, it should be noted both versions found on hawk are out of date from the modern, with the latest version of CheckM being 1.2.3, released on the 25th of June 2024 and the latest version of CheckM2 is version 1.0.2, released on the 19th of May 2023. It would appear that CheckM2 is not a replacement for CheckM, as CheckM is more up to date than CheckM2. Both used their own internal DIAMOND database, dating these is very difficult, so the age of the programs themselves is the only thing i have to go on.

From this information and my analysis of the tables i generated, i conclude that CheckM is the most useful analysis, returning the largest amount of information about a sample and being the most up-to-date. It is difficult to determine which of CheckM and CheckM2 give more reliable results for completeness and contamination, however, if the impact of the analysis had real world implications, i would run them both in parallel due to CheckM2’s more focused nature, this should not add much time to the generation process as both are reletively quick and produce outputs easy to tabulate and analysise. I can confirm that CheckM is a good alternative to manually turning files in the Flye_asm output directories into tables, those ones entailed much difficulty to tabulate and contain less data. However, if there is a situation where coverage is mandatory, the flye.log file is the only one to give a reliable value.

In conclusion, i believe CheckM is the best overall analysis, giving results for all 3 extra analysis combined, in reference to CheckM2 they largely agree in all areas, but if my results had real world consequences i would run both for the conformation, though i can find no way of discerning which method is more accurate. Perhaps in future i can ask for the most modern versions of each to be downloaded to hawk and conduct a comparison on them then. My next projects will be conducted on areas around CheckM due to the discovery of graphical outputs and links to KO pathways, which i hope to yield interesting results. When term restarts i will have to discuss with Aaron the implications for this part of the study. i.e. what does the results of Completeness and contamination mean for our samples? or are these just things we want to know about them?

📌 TODO: (after conc of last bit) investigations into KO and protein stuff / just found out checkm is better documented and has some graph output options, so that is something to look into as well / tree_qa command (fun checkM stuff abound). / dendrogram?

2024-12-30 / 2024-12-31 checkm tree_qa

Introduction

In the CheckM documentation for lineage_wf, there are 6 functions, 5 are categorised as M (Mandetory), however a 6th “checkm tree_qa” is marked as “R” and is stated to be optional, looking closer at this function, it has 5 different potential outputs that look interesting, so i figured i would give it a go and see what i could get out of it, sort of exploratory.

Methods

The vast majority of the code required was already used for the original checkm bash script, so i copied it and made the modifications, producing this file. I could not figure out a way to run all 5 functions at once, so i had to go into the file and change the name (because these were all outputting to one place so naming the function was required to distinguish)and -o function(how to select the desired output) for each iteration. This was not an issue however as when run as a slurm job on hawk, none of the options took longer than 10 seconds to complete. In order to turn the output from option 2 into a table, i created this R script, which outputs this file. I am not sure if i mentioned it above, but i found a way to do the formatting for my tables in this document, which frees up memory and reduces load times, by loading in a .csv of the cleaned data, instead of cleaning the data and doing the whole script in R, which also creates multiple instances. Basically, it was a whole thing, but now it is not a problem because of some clever coding.

viewClade function

Results

I created an output directory to store all the outputs. As mentioned, there were five; 2 are metadata tables which contain taxonomy data, one basic and one with more data. The next two are .tree files which place the samples in a tree with online references from the IMG database, one had the ID on that system, in the other the tip labels contained the full taxonomy of each sample. It is possible i can use these for a comparative analysis of CheckM and gtdbtk or ncbi and img. The fifth and final output is: “multiple sequence alignment of reference genomes and bins which can be used to infer a de novo genome tree”. It looks to be a fasta file for proteins in each sequence, as of now im unsure what to do with it, i recognise de novo from gtdbtk, is there a link? i will have to consult Aaron for this one.

section finished on 2024-12-31

Table 6 - subset of data obtained from secondary CheckM analysis
of samples taken from the skin microbiome of Dendrobates tinctorius, including taxonomy at various levels
Accession Unique Markers (of 43) Taxonomy (Contained) Taxonomy (Sister Lineage)
1Dt100g 41 k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales f__Sphingomonadaceae→g__Sphingomonas
1Dt100h 42 k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales f__Dermabacteraceae
1Dt1h 42 k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Brevibacteriaceae→g__Brevibacterium s__Brevibacterium_linens
1Dt2d 43 k__Bacteria→p__Proteobacteria→c__Gammaproteobacteria→o__Enterobacteriales→f__Enterobacteriaceae→g__Pantoea s__
2Dt1e 32 k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Dermabacteraceae g__Brachybacterium→s__Brachybacterium_faecium
2Dt1l 42 k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Microbacteriaceae→g__Microbacterium unresolved
3Dt1c 43 k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales f__Sphingomonadaceae→g__Sphingomonas
3Dt2c 43 k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales f__Sphingomonadaceae→g__Sphingomonas
3Dt2h 42 k__Bacteria→p__Actinobacteria→c__Actinobacteria→o__Actinomycetales→f__Microbacteriaceae→g__Microbacterium s__Microbacterium_testaceum
3Dt2j 43 k__Bacteria→p__Proteobacteria→c__Alphaproteobacteria→o__Sphingomonadales f__Sphingomonadaceae→g__Sphingomonas
Source: CheckM tree_qa performed on 2024-12-30

This is a subset of output 2 showing some taxonomic assignments for the samples, as well as “unique markers”. 9 are above 40, with only 2Dt1e falling below at 32. Output 1 is a simplified version of Output 2, with the same main fields but missing some found in 2. As mentioned, this is a subset, due to space, and some difficulties in coding, i only took the unique fields. The output table also contained data on number of genes and fragment count, GC etc that is already shown in Table 1: CheckM analysis. As well as this, there were some extra fields focussing on the mean values of each field for the groupings each sample were in, i beleive this is the final value in “Taxonomy (Contained)”, again because of space and relevance i cut them from the final product.

Tree_qa Phylograms

Sphingomonas samples

Figure 2: Phylogram for 4 samples known to be of the genus *Sphingomonas*. Generated by checkm tree_qa on the 30th of December 2024

Figure 2: Phylogram for 4 samples known to be of the genus Sphingomonas. Generated by checkm tree_qa on the 30th of December 2024

Microbacterium samples

Figure 3: Phylogram for 2 samples known to be of the genus *Microbacterium*. Generated by checkm tree_qa on the 30th of December 2024.

Figure 3: Phylogram for 2 samples known to be of the genus Microbacterium. Generated by checkm tree_qa on the 30th of December 2024.

Brachybacterium / Dermabacter sample

Figure 4: Phylogram for 1 sample known to be of the genus *Brachybacterium*. Generated by checkm tree_qa on the 30th of December 2024.

Figure 4: Phylogram for 1 sample known to be of the genus Brachybacterium. Generated by checkm tree_qa on the 30th of December 2024.

Brevibacterium sample

Figure 5: Phylogram for 1 sample known to be of the genus *Brevibacterium*. Generated by checkm tree_qa on the 30th of December 2024.

Figure 5: Phylogram for 1 sample known to be of the genus Brevibacterium. Generated by checkm tree_qa on the 30th of December 2024.

Pantoea sample

Figure 6: Phylogram for 1 sample of disputed genus. Generated by checkm tree_qa on the 30th of December 2024.

Figure 6: Phylogram for 1 sample of disputed genus. Generated by checkm tree_qa on the 30th of December 2024.

Outputs 3 and 4 produce the same tree files, with some minor differences in labeling, so i have only outputted the results from the one that returns the full taxonomic ID of each sample in the tip label. The trees look to be trustworthy, confirming previous taxonomic assignments i have run. It should be noted these use the IMG database, not the NCBI database, this looks to be a smaller database as many of our samples are outgroups on the trees with no direct sister samples, this can be seen most for Figure 2, with all 4 samples in the outgroup. This analysis does confirm again that sample 1Dt2d identified first as “Enterobacter cancerogenus” is in fact a species of the genus “Pantoea”. As mentioned above i am yet to figure out if i can do anything with the output from option 5, so i will discuss with Aaron when term time begins again. Many of the samples cannot be given a species name by this analysis, for example, all of the samples in the genus Sphingomonas are closest to each other, so it is hard to discern if they are of different species.

Conclusion

Output 2, and to a lesser extent 1, present interesting data, it is not as useful as the data produced by the main analysis, but i think it was worth doing. It is possible i am not fully apreciating all of the outputs of this. Only very few of the samples get a taxonomy to the species level, however, it does confirm that sample 1Dt2d, marked “Enterobacter cancerogenus” by early taxonomic analysis actually belongs to the genus Pantoea. Confirming the analysis done all the way back in the technical spike for gtdbtk de novo Conclusion. This is mirrored in the outputs of the phylogenetic trees, which are good, but i doubt are a suitable replacement for the ones output by gtdbtk, which i will get onto at some point in the future. I have mentioned 5 several times, so i will just say that i might come back to look at it later. I learned some new things vis a vis new functions to tabulate like changing text size and pruning/collapsing trees in R, so i am less reliant on dendroscope, and have less tree files floating around. Overall, i think this has been a useful thing to do, even if the outputs are not as important or useful as what i have done or am about to do, even just for the experience alone, but who knows, maybe there is something in this data that will be useful to someone reading this.

📌 TODO: investigations into KO and protein stuff / just found out checkm is better documented and has some graph output options, so that is something to look into as well / tree_qa command (fun checkM stuff abound). / dendrogram? / did i ever link to the gt() documentation? / ggtree heatmaps?

2025-01-01 / 2025-01-02 CheckM2 DIAMOND_RESULTS.tsv

Introduction

I decided to take a look at the full outputs of CheckM2 to see if there was anything interesting to analyse other than the Completeness and Contamination, i found this file called DIAMOND_RESULTS.tsv. It has a bunch of weird stuff in it, i did however recognise the existence of KO pathways in the second column, along with links to a website that displays 3D images of proteins, so i decided to take a closer look at this file.

Methods

The file did not output with column headers, i believe the ones found here match up well, but the descriptions of some are vague, so i am not sure what the significance of a few of the columns is. Due to the size of DIAMOND_RESULTS.tsv, i decided not to include it as a table or image in this document. I created this R script to import the data to a table and do an analysis of the important looking fields. This was quite a difficult task that made this little project stretch over 2 days. The main output i wanted from this was a comparison between the KO pathways outputted from this and those outputted by Eggnog-mapper. Eggnog-mapper still is not working on Hawk so maybe we could get the KO pathways for the heatmaps through CheckM2 instead. I did this analysis at two levels, a comparison of which unique KOs each system recognised, and the “cumulitive” or total differences.

Results

The original file is very large, so i cannot include it in the document, there is a line for each gene in all 10 samples. Here is a screenshot of it in Notepad ++

Figure 7 - Subset of DIAMOND_RESULTS.tsv file without headers
Figure 7 - Subset of DIAMOND_RESULTS.tsv file without headers
Table 7 - Unique KO pathway analysis from CheckM2
based on samples taken from the skin microbiome of Dendrobates tinctorius
Accession Total unique KOs KO # in DIAMOND file KO # from eggNog-mapper files % diff DIAMOND - eggNog % diff eggNog - DIAMOND
1Dt100g 1160 1096 1034 0.9448276 0.8913793
1Dt100h 1179 1055 1107 0.8948261 0.9389313
1Dt1h 1627 1505 1466 0.9250154 0.9010449
1Dt2d 2856 2631 2702 0.9212185 0.9460784
2Dt1e 966 828 854 0.8571429 0.8840580
2Dt1l 1641 1471 1467 0.8964046 0.8939671
3Dt1c 1641 1572 1438 0.9579525 0.8762949
3Dt2c 1654 1582 1447 0.9564692 0.8748489
3Dt2h 1411 1254 1280 0.8887314 0.9071580
3Dt2j 1475 1428 1302 0.9681356 0.8827119
Source: DIAMOND_RESULTS.tsv produced by CheckM2 predict performed on 2024-12-28
Table 8 - KO pathway Total Frequency analysis from CheckM2
based on samples taken from the skin microbiome of Dendrobates tinctorius
Accession EggNog-mapper most-frequent KO pathways Ocurrance number in genome accession.d DIAMOND most-frequent KO pathways Ocurrance number in genome
1Dt100g K00799 8 1Dt100g K00249 11
1Dt100g K02014 6 1Dt100g K00059 8
1Dt100g K00626 4 1Dt100g K02014 8
1Dt100g K00666 4 1Dt100g K00799 7
1Dt100g K11177 4 1Dt100g K01069 4
1Dt100h K02015 8 1Dt100h K02015 8
1Dt100h K01990 7 1Dt100h K01990 6
1Dt100h K02013 6 1Dt100h K02013 6
1Dt100h K02016 6 1Dt100h K02016 6
1Dt100h K02031 6 1Dt100h K02031 5
1Dt1h K02016 18 1Dt1h K00059 21
1Dt1h K00059 13 1Dt1h K02016 18
1Dt1h K01990 12 1Dt1h K02031 11
1Dt1h K01992 11 1Dt1h K02032 11
1Dt1h K02031 11 1Dt1h K07749 11
1Dt2d K02029 28 1Dt2d K18900 21
1Dt2d K03406 21 1Dt2d K07495 16
1Dt2d K02031 17 1Dt2d K00059 15
1Dt2d K02028 14 1Dt2d K02029 15
1Dt2d K02032 14 1Dt2d K02529 15
2Dt1e K02026 20 2Dt1e K02026 19
2Dt1e K02027 18 2Dt1e K02025 18
2Dt1e K02025 15 2Dt1e K02027 17
2Dt1e K06147 9 2Dt1e K02529 12
2Dt1e K02529 8 2Dt1e K06147 9
2Dt1l K02034 30 2Dt1l K02529 30
2Dt1l K02035 28 2Dt1l K02031 27
2Dt1l K02031 26 2Dt1l K02035 25
2Dt1l K02033 26 2Dt1l K02032 24
2Dt1l K02025 22 2Dt1l K02033 24
3Dt1c K02014 24 3Dt1c K02014 40
3Dt1c K03406 14 3Dt1c K03406 16
3Dt1c K14266 11 3Dt1c K14266 14
3Dt1c K02529 7 3Dt1c K02529 8
3Dt1c K03088 7 3Dt1c K05349 8
3Dt2c K02014 23 3Dt2c K02014 35
3Dt2c K03406 17 3Dt2c K07497 18
3Dt2c K14266 12 3Dt2c K03406 17
3Dt2c K02483 7 3Dt2c K14266 17
3Dt2c K03496 7 3Dt2c K00059 8
3Dt2h K02529 14 3Dt2h K02529 18
3Dt2h K02025 13 3Dt2h K02025 11
3Dt2h K02026 13 3Dt2h K02026 11
3Dt2h K02027 12 3Dt2h K02027 11
3Dt2h K03088 9 3Dt2h K03088 10
3Dt2j K02014 12 3Dt2j K02014 21
3Dt2j K00626 6 3Dt2j K00249 9
3Dt2j K00799 6 3Dt2j K00059 8
3Dt2j K07107 5 3Dt2j K00799 8
3Dt2j K03088 4 3Dt2j K01897 7
Source: DIAMOND_RESULTS.tsv produced by CheckM2 predict performed on 2024-12-28

Tables 7 and 8 represent my analysis of the DIAMOND_RESULTS.tsv file. I decided to do this by comparing it to the outputs i got from eggNog-mapper when i ran it earlier in the year. Table 7 represents a comparison of unique KO pathways. Often, the same KO pathway is found multiple times in the same genome, this makes evolutionary sense but if CheckM2 is a good alternative to eggNog-mapper, which does not work on hawk yet then it must find similar KO pathways. Table 7 would confirm this as similarities between output files consistently stay above 85%. Table 8 represents a comparison of highest frequency, taken as the top 5 most frequent for each genome. Basically, what are each system finding most often. Often, there will be similarities, with values only off by 2 or 3, however, occasionally the most common in DIAMOND will be absent from the list for eggNog-mapper and vise-versa.

Conclusion

This analysis has been interesting, it is possible there is more here i am not fully understanding, but the resources to fact check and better understand are few and far between, so maybe Aaron will have something to say when he gets here. Table 7 suggested that the KO pathways being found are very similar. However, Table 8 suggests the opposite in most cases when repeats are taken into account. Ultimately, i do not think CheckM2 is a replacement for eggNog-mapper as this analysis is not its primary function. On the other hand, eggNog-mapper on hawk does not work at all, so i am stuck there, and if i have the time i may just use CheckM2 anyway to see what happens, i will make note that the results may not be entirely accurate if i take this path however.

2025-01-03 Statistical analysis of yesterdays work (finished on 2025-01-05)

Introduction

I figured if i am going to make decisions based on if two methods are different, i should compare them statistically instead of eyeballing it. It is simple enough, tests, graphs, that sort of stuff. This lead me to do many things in R, but only some of the analyses i did were valid.

Methods

I started off testing which KO values against which accessions in each file, excluding repetitions. However, i then realised that this methodology did not give a full picture of the results, so i decided to work on trying to identify when duplicates occured and add them to the dataset. After much trial and error to get right, i created this file. I included all the dead-ends as comments at the end for posterity. I finally decided to use the shared KO values divided by the total for each accession and method to create my values showing the differences between DIAMOND and eggNog-mapper. I then decided to run a statistical analysis to see if the differences observed between groups was significant.

Results

Table 9 - Shared KO values between two systems represented as proportions of their totals
based on samples taken from the skin microbiome of Dendrobates tinctorius
Accession % of shared KO values present in DIAMOND_RESULTS.tsv % of shared KO values present in eggNog-mapper outputs
1Dt100g 84.9% 92.0%
1Dt100h 91.5% 84.9%
1Dt1h 84.7% 87.3%
1Dt2d 88.4% 82.8%
2Dt1e 82.6% 82.1%
2Dt1l 83.4% 85.4%
3Dt1c 82.1% 94.1%
3Dt2c 80.7% 93.7%
3Dt2h 86.7% 83.4%
3Dt2j 84.1% 94.9%
Source: DIAMOND_RESULTS.tsv produced by CheckM2 predict performed on 2024-12-28 and the series of eggNog-mapper outputs i generated using eggNog-mapper v2 for the web 2024-08-28
shapiro.test(groupby$DIAMOND_per)
## 
##  Shapiro-Wilk normality test
## 
## data:  groupby$DIAMOND_per
## W = 0.94467, p-value = 0.6061
shapiro.test(groupby$eggNog_per)
## 
##  Shapiro-Wilk normality test
## 
## data:  groupby$eggNog_per
## W = 0.8645, p-value = 0.08619
compttest <- t.test(groupby$DIAMOND_per, groupby$eggNog_per)
compttest
## 
##  Welch Two Sample t-test
## 
## data:  groupby$DIAMOND_per and groupby$eggNog_per
## t = -1.6563, df = 15.199, p-value = 0.1182
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07200294  0.00899259
## sample estimates:
## mean of x mean of y 
## 0.8490075 0.8805126

Table 9 shows that the common KO values represent a very large proportion of the total KO values for both methods and all accessions, never dipping below 80%. Shapiro testing showed that a parametric test could be used for this. I decided on a Student’s t-test(t(15.199) = -1.6563, p = 0.1182). As the p value is larger than 0.05, we accept the null hypothesis that the groups are not significantly different.

Conclusion

I believe that my work over the past few days proves that if i have time between doing the exams, i can use CheckM2 as an alternative to eggNog-mapper, because as of now, eggNog-mapper still does not function on hawk. However, when that situation gets sorted out i will also do analyses using it and then do something very similar to what i did here. If i cannot get there before eggNog-mapepr is fixed, then i will only use that method.

#—————————————————————————————————— # Continued in Notebook #2